FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs
نویسندگان
چکیده
Graph analysis performs many random reads and writes, thus, these workloads are typically performed in memory. Traditionally, analyzing large graphs requires a cluster of machines so the aggregate memory exceeds the graph size. We demonstrate that a multicore server can process graphs with billions of vertices and hundreds of billions of edges, utilizing commodity SSDs with minimal performance loss. We do so by implementing a graph-processing engine on top of a user-space SSD file system designed for high IOPS and extreme parallelism. Our semi-external memory graph engine called FlashGraph stores vertex state in memory and edge lists on SSDs. It hides latency by overlapping computation with I/O. To save I/O bandwidth, FlashGraph only accesses edge lists requested by applications from SSDs; to increase I/O throughput and reduce CPU overhead for I/O, it conservatively merges I/O requests. These designs maximize performance for applications with different I/O characteristics. FlashGraph exposes a general and flexible vertex-centric programming interface that can express a wide variety of graph algorithms and their optimizations. We demonstrate that FlashGraph in semi-external memory performs many algorithms with performance up to 80% of its in-memory implementation and significantly outperforms PowerGraph, a popular distributed in-memory graph engine.
منابع مشابه
BigSparse: High-performance external graph analytics
We present BigSparse, a fully external graph analytics system that picks up where semi-external systems like FlashGraph and X-Stream, which only store vertex data in memory, left off. BigSparse stores both edge and vertex data in an array of SSDs and avoids random updates to the vertex data, by first logging the vertex updates and then sorting the log to sequentialize accesses to the SSDs. This...
متن کاملAn SSD-based eigensolver for spectral analysis on billion-node graphs
Many eigensolvers such as ARPACK and Anasazi have been developed to compute eigenvalues of a large sparse matrix. These eigensolvers are limited by the capacity of RAM. They run in memory of a single machine for smaller eigenvalue problems and require the distributed memory for larger problems. In contrast, we develop an SSD-based eigensolver framework called FlashEigen, which extends Anasazi e...
متن کاملFlashMatrix: Parallel, Scalable Data Analysis with Generalized Matrix Operations using Commodity SSDs
FlashMatrix is a matrix-oriented programming framework for general data analysis with high-level functional programming interface. It scales matrix operations beyond memory capacity by utilizing solid-state drives (SSDs) in non-uniform memory architecture (NUMA). It provides a small number of generalized matrix operations (GenOps) and reimplements a large number of matrix operations in the R fr...
متن کامل$n$-Array Jacobson graphs
We generalize the notion of Jacobson graphs into $n$-array columns called $n$-array Jacobson graphs and determine their connectivities and diameters. Also, we will study forbidden structures of these graphs and determine when an $n$-array Jacobson graph is planar, outer planar, projective, perfect or domination perfect.
متن کاملEstimating graph distance and centrality on shared nothing architectures
We present a parallel toolkit for pairwise distance computation in massive networks. Computing the exact shortest paths between a large number of vertices is a costly operation, and serial algorithms are not practical for billion-scale graphs. We first describe an efficient parallel method to solve the single source shortest path problem on commodity hardware with no shared memory. Using it as ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015